Skip to content

[GPU] New QIPC ops for block#684

Open
hughperkins wants to merge 23 commits into
hp/new-qipc-ops-subgroupfrom
hp/new-qipc-ops-block
Open

[GPU] New QIPC ops for block#684
hughperkins wants to merge 23 commits into
hp/new-qipc-ops-subgroupfrom
hp/new-qipc-ops-block

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented May 10, 2026

Summary

Closes the block-level QIPC gaps listed in perso_hugh/doc/qipc/qipc_gaps_block.md by adding three new families of block-scope primitives under qd.simt.block, stacked on the new subgroup primitives branch (hp/new-qipc-ops-subgroup).

  • block.reduce_{add,min,max} / block.reduce_all_{add,min,max} — CUB-style two-stage warp-reduction tree (per-warp subgroup.reduce_* into shared memory, second pass reduces the per-warp partials). _all variants broadcast the result to every thread via shared memory.
  • block.{inclusive,exclusive}_{add,min,max} — CUB-style block scan via warp scans (subgroup.{inclusive,exclusive}_* per warp, exclusive scan of warp aggregates in shared memory, fold prefix back into per-thread results).
  • block.radix_rank_match_atomic_or — CUB's BlockRadixRankMatchEarlyCounts ATOMIC_OR path: per-warp histogram via atomic_add on shared memory, block-wide exclusive prefix over digit columns, intra-warp match via atomic_or on a per-digit lane mask, leader election via clz/popcnt, leader's atomic_add plus subgroup.shuffle broadcast for the warp offset.

Files

  • python/quadrants/lang/simt/block.py — implementations (~370 LoC)
  • tests/python/test_simt.py — 78 new parametrized tests (36 reduce + 36 scan + 6 radix-rank), all arch=qd.gpu
  • docs/source/user_guide/block.md — user-facing docs with API tables, semantics, costs, and examples for each new primitive

Notes / gotchas resolved during development

  • CUDA cache staleness on first build. After pulling the subgroup branch, the qdcache contained pre-shuffle-up kernels and surfaced as LLVMRuntime function subgroupShuffleUp not found. rm -rf ~/.cache/quadrants/qdcache/* fixes it; not a code bug.
  • SPIR-V FindSMsb vs FindUMsb (commit d3e73bab4). leader = 31 - clz(cast(bin_mask, i32)) produced wrong leaders on Vulkan when bin_mask had bit 31 set (e.g., a single-lane match in lane 31): FindSMsb on a negative i32 returns the position of the highest 0-bit, not the 1-bit, so the leader was elected as lane 1 instead of lane 31 and non-leader lanes read warp_offset 0 from a non-leader lane. Passing u32 directly to clz dispatches FindUMsb and matches CUDA's __clz.
  • SPIR-V subgroupBarrier has no memory semantics. OpControlBarrier(ScopeSubgroup, ScopeSubgroup, MemorySemantics=0) does not publish shared-memory writes, unlike CUDA's __syncwarp. radix_rank pairs every warp-sync inside the match phase with subgroup.mem_fence() (_warp_sync_fence helper) so the lane-mask publish/clear sequence has correct release/acquire ordering on Vulkan / Metal.
  • lane_id = invocation_id() was wrong on wave64 (commit bac3f9f41). On CDNA AMDGPU the hardware wave is 64 lanes wide but log2_warp=5 partitions a block into 32-lane logical warps, so invocation_id() == 0 only fires once per hardware wave instead of once per logical warp — every other logical warp's aggregate was silently dropped. Replaced with tid & (WARP_SIZE-1) (CUB's threadIdx.x & 31 recipe). Wave32 backends are unaffected.

Tests

  • CUDA (RTX, single-GPU rtx-high): 82/82 -k block passing
  • Vulkan (RTX, single-GPU rtx-high): 81 passed + 1 skipped, -k block
  • AMDGPU CDNA (MI300X amdcloud): partial — see follow-up below

AMDGPU CDNA wave64 follow-up (not blocking this PR)

block.reduce / block.scan work on AMDGPU only when the block is exactly one logical warp (block_dim == 32 with log2_warp=5). Multi-warp blocks fail because shuffle_down on AMDGPU's ds_bpermute lowering wraps OOB indices modulo the wave size — i.e. on a wave64 wave, shuffle_down(value, 16) from lane 48 reads lane 0 instead of returning the lane's own value (CUDA's __shfl_down_sync semantics with width=32). subgroup.reduce_* / subgroup.{inclusive,exclusive}_* inherit the same constraint; the parent subgroup branch documents log2_size must equal the hardware wave size on CDNA.

block.radix_rank_match_atomic_or is wave32-only by construction: the bin_mask atomic_or + clz + popcnt leader pattern is keyed on a u32 lane mask. Wave64 would need a u64 bin_mask, u64 clz / popcnt, and 64-lane subgroup.shuffle for the warp_offset broadcast.

Both follow-ups are tracked separately and are not in scope for closing the QIPC block gaps (QIPC currently consumes these on CUDA and Vulkan).

Test plan

  • CUDA: QD_WANTED_ARCHS=cuda pytest tests/python/test_simt.py -k block
  • Vulkan: QD_WANTED_ARCHS=vulkan pytest tests/python/test_simt.py -k block
  • AMDGPU: documented as wave32-only / single-warp follow-up
  • Metal CI

… reduce)

Ports qipc's `qipc/_src/core/block/reduce.py` into Quadrants as
`qd.simt.block.reduce_{add,min,max}` (lane-0-only) and
`qd.simt.block.reduce_all_{add,min,max}` (broadcast).  Closes one of the
three block-tier rows in `qipc_gaps_block.md`.

Strategy is CUB's `BLOCK_REDUCE_WARP_REDUCTIONS`: per-warp `shuffle_down`
tree reduce -> lane 0 of each warp publishes the warp aggregate into a
`SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> thread 0
sequentially folds the warp aggregates with the same operator.  The
`reduce_all_*` flavour adds one extra `block.sync()` plus a one-slot
shared-memory broadcast.  When the block is exactly one warp the
shared-memory path is short-circuited at trace time via a
`impl.static(NUM_WARPS == 1)` guard, so single-warp blocks pay only the
per-warp tree.

Generic `block.reduce(value, tid, block_dim, log2_warp, op, dtype)` and
`block.reduce_all(...)` accept an arbitrary template binary `op` for
custom monoids; the `_add`/`_min`/`_max` shortcuts wire in the binary
ops already exported by `simt/subgroup.py`.

API mirrors qipc: the caller passes `tid` explicitly (Quadrants does
not yet have a portable in-kernel `block.thread_idx()` on CUDA /
AMDGPU; pass `i % block_dim`).  `block_dim` and `log2_warp` are
`template()` so the call lowers to a fully-unrolled IR with no runtime
branches on warp size.

Tests in `tests/python/test_simt.py` parametrise over
`{i32, f32} x {block_dim ∈ [32, 128, 256]}`, covering the single-warp
short-circuit (32) and the multi-warp shared-mem path (128, 256).
Reduce-min/max use a permuted (non-monotone) initialiser so the result
genuinely depends on every lane, not just the first or last.  Run on
`arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the
ports.

Stacks on `hp/new-qipc-ops-subgroup` (which provides sized
`subgroup.reduce_*`, `_bin_*`, `shuffle_down`, `invocation_id`).

Documentation updated in `docs/source/user_guide/block.md`: support
table grows two rows; new `block.reduce_{add,min,max}` and
`block.reduce_all_{add,min,max}` sections explain the contract, list
the args, and show a worked example.
… scan)

Ports qipc's `qipc/_src/core/block/scan.py` into Quadrants as
`qd.simt.block.inclusive_{add,min,max}` and
`qd.simt.block.exclusive_{add,min,max}`, plus the generic
`block.{inclusive,exclusive}_scan(..., op, [identity,] dtype)` for
custom monoids.  Closes the second of three block-tier rows in
`qipc_gaps_block.md`.

Strategy is CUB's `BLOCK_SCAN_WARP_SCANS`: per-warp Hillis-Steele scan
via `subgroup.{_inclusive_scan, _exclusive_scan}` -> last lane of
every warp publishes the warp aggregate to a
`SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> every thread
sequentially folds the cross-warp prefix and applies its own warp's
prefix to its scan value.  Cross-warp prefix is computed redundantly
on every thread to avoid a second barrier (same trade-off CUB makes).

Inclusive: warp aggregate at the last lane is just the inclusive
value, written directly.  Exclusive: warp aggregate is recovered as
`op(exclusive[last_lane], value[last_lane])`, since the exclusive
scan does not include the last lane's input.  When the block is
exactly one warp the cross-warp shared-mem path is short-circuited at
trace time.

API mirrors the new block reduce: caller passes `tid` explicitly,
`block_dim` and `log2_warp` are `template()`.  `exclusive_add`
derives the additive identity from `value - value` (matches
`subgroup.exclusive_add`); `exclusive_min` / `exclusive_max` take an
explicit `identity` because no portable type-extreme is derivable
from `value` alone (matches `subgroup.exclusive_min`).

Tests in `tests/python/test_simt.py` parametrise over
`{i32, f32} x {block_dim ∈ [32, 128, 256]}`, asserting per-thread
against a sequential CPU oracle.  Min / max use a permuted
(non-monotone) input so the scan result genuinely depends on every
prefix step, not just the trailing or leading element.  Run on
`arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the
ports.

Documentation updated in `docs/source/user_guide/block.md`: support
table grows two rows; new `block.inclusive_*` and `block.exclusive_*`
sections explain the contract, identity rules, and the cost profile.
Ports qipc's `qipc/_src/core/sort/_block_radix_rank.py` into Quadrants
as `qd.simt.block.radix_rank_match_atomic_or`.  Closes the third (and
final) row in `qipc_gaps_block.md`.

Faithful port of CUB's
`BlockRadixRankMatchEarlyCounts<WARP_MATCH_ATOMIC_OR>` from
`cub/block/block_radix_rank.cuh` (the SM90 onesweep policy):

  1. Per-warp digit histograms via shared-memory atomic_add.
  2. Per-thread column-sum upsweep across warps (rewrites warp
     histograms into per-warp running prefixes; yields per-thread
     bin_count for digit == tid).
  3. Block exclusive scan on bin_count (uses our newly-added
     `block.exclusive_add`).
  4. Downsweep: fold the block-wide exclusive prefix into every warp's
     offset entry.
  5. Per-key match via shared-memory atomic_or on a per-digit lane
     mask; leader (highest set lane) does a single atomic_add on the
     warp offset and broadcasts via `subgroup.shuffle`; each thread's
     rank = warp_offset + popc(bin_mask & lanemask_le) - 1.
  6. Publish bins + exclusive_digit_prefix to caller-supplied
     SharedArray outparams; one block.sync() retires before exit so
     the caller can read them without an extra barrier.

Improvements over the qipc port:

  - WARP_SYNC (qipc's `warp.ballot(qd.cast(1, qd.i32))` hack) is
    replaced by `subgroup.sync()` which lowers to `__syncwarp` /
    `OpControlBarrier(ScopeSubgroup, ...)` / `s_barrier` on the
    appropriate backend.
  - CUB's `LaneMaskLe()` PTX intrinsic + qipc's `_lane_mask_le`
    workaround function are replaced by the new portable
    `subgroup.lanemask_le(invocation_id())` (saves ~13 LoC and
    sidesteps the lane==31 overflow special case).
  - `BlockScan::ExclusiveSum` is replaced by the just-added
    `block.exclusive_add`, which drops the qipc kernel's reliance on
    a private sequential cross-warp scan.

API: `radix_rank_match_atomic_or(key, tid, block_dim, log2_warp,
radix_bits, bit_start, num_bits, bins, excl_prefix)` — `key` is one
u32 per thread; `bins` and `excl_prefix` are
`block.SharedArray((1 << radix_bits,), qd.i32)` outparams.  Returns
the per-thread stable rank.  Currently constrained to
`block_dim == 1 << radix_bits`, `log2_warp == 5` (wave32), and
`items_per_thread == 1`; multi-item and wave64 paths are future work.

Tests in `tests/python/test_simt.py` exercise three input
distributions against a CPU oracle:
  - low_entropy: 16 distinct digits each repeated 16 times (heavy
    match-path traffic with several lanes per warp colliding).
  - uniform: full 16-bit uniform random.
  - uniform_high_bits: digit drawn from bits [8, 16) (covers
    bit_start > 0).

The oracle checks bins, excl_prefix, and the per-thread ranks
(uniqueness as a permutation of [0, 256) plus value match).  Run on
`arch=qd.gpu`.

Documentation updated in `docs/source/user_guide/block.md`: support
table grows one row, new section explains the constraints, args,
cost, and shows a worked example.
`subgroup.sync()` lowers to `OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)`
on SPIR-V — i.e. the Memory Semantics operand is **0**.  Bare control
barriers without memory semantics do not publish prior shared-memory
writes to other lanes, so on Vulkan / Metal the radix-rank algorithm
silently saw stale `atomic_or` / `atomic_add` results across the warp,
producing off-by-one ranks (e.g. `actual_ranks[14] == 15` vs
`expected[14] == 14` on the uniform_high_bits pattern).

Wrap the WARP_SYNC pattern in `_warp_sync_fence()` which issues both
`subgroup.sync()` and `subgroup.mem_fence()`.  On CUDA, the explicit
mem_fence is a redundant `__threadfence_block` — slight overhead but
always correct.  On SPIR-V, the mem_fence emits a real
`OpMemoryBarrier(ScopeSubgroup, AcquireRelease | UniformMemory |
WorkgroupMemory)` and restores CUB's `__syncwarp` shared-memory
visibility invariant that the algorithm depends on.

Replaces three `subgroup_sync()` call-sites in
`radix_rank_match_atomic_or`: the histogram-zero retire, the post
`atomic_or` retire, and the leader-clear retire.

The longer-term fix is to emit the proper memory semantics on
`subgroupBarrier` in `spirv_codegen.cpp` (matching the pattern used
by `subgroupMemoryBarrier`), but that lives on the subgroup branch /
needs its own runtime rebuild — handle here in user code so this PR
stays self-contained.
The Vulkan / Metal radix-rank failures were not the SPIR-V
subgroupBarrier-without-memory-semantics quirk after all: pairing
sync() with mem_fence() was correct for general robustness but did not
fix the test.

Real bug: ``leader = 31 - clz(cast(bin_mask, i32))`` triggered SPIR-V's
GLSL.std.450 FindSMsb on the i32, which for negative values (top bit
set) returns the index of the most-significant **0**-bit, not the
1-bit.  Concretely, when only the highest lane in a warp has a given
digit, bin_mask = 0x80000000; FindSMsb on (i32)-2147483648 returns 30
(bit 30 is the highest 0-bit), so the leader was elected as lane
``31 - 30 = 1`` instead of lane 31.  All non-leader lanes in the
match group then read warp_offset from lane 1 (which had stale 0
because lane 1 was *not* the actual leader and never atomic_add'd),
producing the observed last-lane-of-warp off-by-one ranks.

Fix: pass the u32 directly to ``clz`` so SPIR-V dispatches FindUMsb
(which returns MSB-of-1 unconditionally and is already cross-backend on
the subgroup branch).  This matches what CUB's ``__clz`` does on CUDA
and what AMDGPU emits via the new amdgcn-clz lowering.

Restore the ``_warp_sync_fence`` (subgroup.sync + subgroup.mem_fence)
calls in the match phase — they're orthogonal to the clz fix but
remain the right thing for memory visibility across the warp on
SPIR-V (whose ``OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)``
has no memory semantics; pairing it with an explicit
``OpMemoryBarrier(ScopeSubgroup, AcqRel | UniformMem | WorkgroupMem)``
restores CUDA ``__syncwarp``'s shared-memory visibility invariant).

CUDA was unaffected because ``__clz`` ignores signedness (counts
leading zeros bit-pattern-wise) — the failure only showed up on the
SPIR-V GLSL extended instruction set.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e73bab45

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/quadrants/lang/simt/block.py Outdated
WARP_SIZE = impl.static(1 << log2_warp)
NUM_WARPS = impl.static(block_dim // WARP_SIZE)

inclusive = _subgroup_inclusive_scan(value, op, log2_warp)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve operand order for generic scans

For custom block.inclusive_scan/exclusive_scan users with an associative but non-commutative operator, this produces the wrong prefix order: the reused subgroup scan folds the shuffled lower-lane value as the second operand (op(current, previous)) rather than op(previous, current). The docs for these new generic APIs promise op(v[0], ..., v[i]) and "custom monoids", so matrix-multiply-style monoids will return reversed/scrambled prefixes within each warp; the added add/min/max tests don't catch this because those ops are commutative.

Useful? React with 👍 / 👎.

…vocation_id()

On wave32 hardware (CUDA, Metal, RDNA AMDGPU, most Vulkan compute) the
two are equivalent: hardware wave size equals the logical 32-lane warp,
so ``invocation_id()`` runs 0..31 within each warp.

On wave64 hardware (CDNA AMDGPU) the hardware wave is 64 lanes wide
but our log2_warp=5 partitions a block into logical 32-lane warps —
so within one hardware wave we have *two* logical warps (lanes 0..31
and 32..63 of the wave).  ``invocation_id()`` runs 0..63 across that
pair, meaning ``invocation_id() == 0`` only fires once per hardware
wave instead of once per logical warp.  The publish step
``if lane_id == 0: shared[warp_id] = warp_agg`` then skips every odd
logical warp's aggregate, which on a 256-thread block produced
sums missing 4 of 8 warps — exactly the failure mode the local
amdcloud run hit (got 14400 vs expected 32896, the difference being
warps 1,3,5,7).

Replace ``invocation_id()`` with ``tid & (WARP_SIZE-1)`` so the lane
index always means "lane within the logical 32-lane warp".  This
matches CUB's ``threadIdx.x & 31`` recipe on CUDA wave32 and stays
correct when the hardware wave is wider than the logical warp.

Note: this fix alone does *not* make ``block.reduce`` correct on
wave64 with log2_warp=5 — the per-warp ``_warp_reduce`` /
``subgroup.reduce_*`` step still does 5-iteration shuffle_down trees
with offsets 16..1 inside a 64-lane wave, and lane 48's offset-16
shuffle wraps to lane 0 (ds_bpermute address modulo wave bytes) so
warp 1 of every wave gets a contaminated aggregate.  Making block
reduce/scan fully wave64-correct requires either passing log2_warp=6
on CDNA (so the warp size equals the hardware wave) or a width-clip
on shuffle_down — both out of scope for this PR.  Documented in the
test parametrisation; CDNA support is a follow-up.

The radix_rank match phase still uses ``invocation_id()`` because
it deliberately operates on the hardware wave (the bin_mask atomic_or
+ clz + popcnt leader pattern is wave-wide by construction; making it
wave64-correct would need a u64 bin_mask and is also out of scope).
@github-actions
Copy link
Copy Markdown

Comment thread docs/source/user_guide/block.md Outdated
| `block.SharedArray(shape, dtype)` | yes | yes | yes |
| `block.global_thread_idx()` | yes | yes | — |
| `block.thread_idx()` | no | no | yes |
| `block.reduce_{add,min,max}(v, tid, ...)` | yes | yes | yes |
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why do we need to pass in tid?

Comment thread docs/source/user_guide/block.md Outdated
| `block.inclusive_{add,min,max}(v, tid, ...)` | yes | yes | yes |
| `block.exclusive_{add,min,max}(v, tid, ...)` | yes | yes | yes |
| `block.radix_rank_match_atomic_or(...)` | yes | yes | yes |
| `grid.memfence()` (device-scope, see below) | yes | no | no |
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this still in block.md?

Comment thread docs/source/user_guide/block.md Outdated

- `value`: per-thread input.
- `tid`: calling thread's block-local index. Pass `i % block_dim` from a `qd.loop_config(block_dim=...)` kernel, or `qd.simt.block.thread_idx()` on backends that expose it.
- `block_dim`: threads per block (compile-time `template()`; must be a multiple of `2**log2_warp`).
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass in block_dim

@github-actions
Copy link
Copy Markdown

@hughperkins hughperkins force-pushed the hp/new-qipc-ops-block branch from 7f2538b to 009aba4 Compare May 10, 2026 22:40
Pulls in the upstream subgroup-branch refresh (block.sync_*_nonzero now
emulated cross-GPU on AMDGPU/Vulkan/Metal, mem_sync renamed to
mem_fence, block.thread_idx ported to CUDA/AMDGPU, the underscore-
prefixed import convention) and merges with the new block reduce / scan
/ radix_rank work on top.

Conflict resolution:

- python/quadrants/lang/simt/block.py: union the two import sets and
  align everything I added on the base branch's underscore convention
  (`func as _func`, `i32 as _i32`, `u32 as _u32`, `ops as _ops`,
  `subgroup as _subgroup`).  All call sites in the new code now go
  through `_ops.atomic_add` / `_ops.cast` / `_ops.clz` /
  `_ops.popcnt` / etc. and `_subgroup.shuffle` /
  `_subgroup.invocation_id` / `_subgroup.lanemask_le` /
  `_subgroup.sync` / `_subgroup.mem_fence`.  No behavioural change.

- docs/source/user_guide/block.md: take the base's wider support table
  (CUDA / AMDGPU / Vulkan / Metal columns) plus the `yes*` cells for
  the now-emulated sync_*_nonzero ops, and add the new rows for
  reduce / scan / radix_rank.  The new rows carry `yes**` for AMDGPU
  with a follow-up footnote explaining the wave32-vs-wave64 log2_warp
  contract; radix_rank specifically is wave32-only and the cell says so.
  Also drop the stale `## Grid-scope fence` heading + body that an
  earlier round had left behind in this file (grid-scope is documented
  in grid.md, not here).

- tests/python/test_simt.py: auto-merged.
@hughperkins hughperkins force-pushed the hp/new-qipc-ops-block branch from 009aba4 to 351017b Compare May 10, 2026 22:41
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@hughperkins hughperkins changed the title [feat] Block-level reduce, scan, and radix rank primitives [feat] New QIPC ops for block May 12, 2026
@hughperkins hughperkins changed the title [feat] New QIPC ops for block [GPU] New QIPC ops for block May 12, 2026
@github-actions
Copy link
Copy Markdown

… ops

block.reduce*, block.inclusive_*, block.exclusive_*, and
block.radix_rank_match_atomic_or now read the calling thread's block-local
index via block.thread_idx() internally rather than taking it as a parameter.
The change is correctness-neutral and either neutral or a small win
performance-wise: on every backend block.thread_idx() lowers to a single
builtin register read that LLVM / spirv-opt / Metal-compiler CSE reliably,
while non-power-of-two block_dim cases avoid a runtime modulo at the call
site.

Also collapse the redundant _reduce / _reduce_all / _inclusive_block /
_exclusive_block private helpers into the public reduce / reduce_all /
inclusive_scan / exclusive_scan: the prior structure had each public op as
a thin wrapper that only added the now-removed tid parameter, so the
private layer no longer earns its keep.  reduce_add / reduce_min /
reduce_max etc. continue to delegate to the generic reduce / scan ops.
…ally

Every block-scope reduction, scan, and the radix-rank primitive previously
took log2_warp as a template parameter, requiring callers to plumb in 5 on
wave32 backends or 6 on wave64.  Now that subgroup.group_size() and
subgroup.log2_group_size() return compile-time Python ints (32 on CUDA /
Metal / Vulkan-on-NVIDIA, 64 on AMDGPU), each block op reads the subgroup
size itself and the parameter is no longer plumbed at the call site.

Adds a compile-time guard (impl.static_assert) at the top of every block op
that block_dim is a positive multiple of subgroup.group_size() -- catches
misconfigurations like block_dim=32 on AMDGPU (wave64) with a clear error
message instead of a silent NUM_WARPS=0 miscompile.

The radix-rank op also gains a compile-time assert that subgroup.group_size()
is 32: the atomic-OR match path is built on 32-lane i32 ballot masks, and
the wave64 path is not yet implemented (parallel to the existing constraint
in the docstring).
The base branch already did this sweep across most user-facing docs and
docstrings; these mentions slipped in after the original cleanup and from
test_atomic.py which the sweep missed.  No semantic change.
…es Python ints

Inside a @qd.func body, plain Python ops like block_dim % WARP_SIZE get
traced into Quadrants Expressions rather than evaluated as Python ints, so
the new compile-time guards in reduce / scan / radix_rank were failing with
"Static assert with non-static condition" on every backend.

Restoring the impl.static(...) wrappers around the template-int arithmetic
(WARP_SIZE, NUM_WARPS, BLOCK_WARPS, the static_assert condition, SharedArray
shape tuples, and the per-iteration shared-memory indices) forces evaluation
at trace time and the constants reach static_assert / SharedArray as plain
Python ints again, matching the pre-API-cleanup shape.

Pure correctness fix; semantically equivalent to the previous shape.
@github-actions
Copy link
Copy Markdown

…k.py uses it

Mirrors the existing subgroup._inclusive_scan / _exclusive_scan private
helpers in shape -- generic op: template(), log2_size: template(), private
because the generic-op contract is fragile and we don't want to invite
ad-hoc subgroup reductions from arbitrary kernels.  Nothing in subgroup.py
routes through the new helper; the typed reduce_add / reduce_min /
reduce_max keep their existing hard-coded bodies, so no existing IR moves.

block.reduce drops its local _warp_reduce helper and calls
subgroup._reduce instead.  This brings block.py's three per-subgroup steps
into a consistent shape (subgroup._reduce / _inclusive_scan / _exclusive_
scan), removes the duplicated 4-line shuffle tree, and lines up with the
existing convention where block.py only calls into private subgroup
helpers for the generic-op paths.

Identical IR; pure code motion + one call-site rewrite.
block.py is cross-GPU code -- its abstractions (block, subgroup, lane,
shuffle, ballot) come from the SPIR-V / Vulkan / portable subgroup model,
not from CUDA's warp model.  Calling the per-subgroup step a "warp" leaks
CUDA terminology into a portable surface.  This commit sweeps every
cross-GPU identifier and prose mention from "warp" to "subgroup", and
leaves only the genuinely CUDA-specific terms intact (``__syncwarp`` in
comments documenting CUDA's primitive, and the legacy ``qd.simt.warp.*``
test module which is CUDA-only).

block.py identifier renames:
  WARP_SIZE        -> SUBGROUP_SIZE
  WARP_THREADS     -> SUBGROUP_THREADS    (radix-rank local)
  log2_warp        -> log2_subgroup
  NUM_WARPS        -> NUM_SUBGROUPS
  BLOCK_WARPS      -> BLOCK_SUBGROUPS
  warp_id          -> subgroup_id
  warp_idx         -> subgroup_idx
  warp_agg         -> subgroup_agg
  warp_prefix      -> subgroup_prefix
  warp_count       -> subgroup_count
  warp_offset      -> subgroup_offset
  j_warp           -> j_subgroup           (radix-rank loop var)
  _warp_sync_fence -> _subgroup_sync_fence
  CUB stage names: ComputeHistogramsWarp / ComputeOffsetsWarpUpsweep /
  ComputeOffsetsWarpDownsweep -> ...Subgroup / ...SubgroupUpsweep /
  ...SubgroupDownsweep (algorithm-internal names, not exported anywhere).

block.md:
  Deletes the stale ``**`` footnote paragraph that claimed AMDGPU callers
  must pass log2_warp=6 -- the parameter no longer exists; subgroup size
  is read from subgroup.group_size() at compile time, and the same source
  compiles correctly on wave32 and wave64.  Sweeps prose "warp" ->
  "subgroup" in the algorithm descriptions and cost expressions.  Keeps
  "subgroup (warp / wavefront)" in the dual-term glossary line.

test_simt.py:
  Sweeps prose in our block-reduce / block-scan / block-radix-rank
  section headers.  Rewrites the stale ``log2_warp is pinned to 5``
  comment to reflect the current compile-time auto-detection.  Leaves
  the legacy qd.simt.warp.* tests, the segmented-reduce tests, and the
  ``__syncwarp`` mention in test_block_sync untouched.

Pure rename + doc cleanup; identical IR.
@hughperkins hughperkins force-pushed the hp/new-qipc-ops-block branch from 5a2f552 to 5d1d17e Compare May 12, 2026 21:00
@hughperkins hughperkins force-pushed the hp/new-qipc-ops-block branch from 5d1d17e to f85c12c Compare May 12, 2026 21:00
Splits the previously unioned temp-storage into two SharedArrays: smem_offsets
(i32, always) and smem_match (i32 on wave32, i64 on wave64). The match phase
(step 5) is now compile-time-gated by subgroup size: wave32 keeps the existing
32-bit ballot mask + u32 atomic_or / clz / popcnt; wave64 uses a 64-bit ballot
mask + u64 atomic_or (native on AMDGPU LDS) + u64 clz / popcnt, with an inline
u64 lanemask_le (subgroup.lanemask_le is u32-only by contract). Wave32 backends
never see the i64 path, so portability does not depend on SPIR-V / Metal
supporting 64-bit threadgroup atomics.
@hughperkins hughperkins force-pushed the hp/new-qipc-ops-block branch from f85c12c to cd9e546 Compare May 12, 2026 21:04
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Brings in the AMDGPU RDNA-wave64 cross-half shuffle fix in the subgroup
runtime (permlane64-based), plus log2_size=6 absolute-correctness tests for
subgroup reduce / scan. Unblocks amddesktop validation of the wave64 path
in block.radix_rank_match_atomic_or.
Block reduce / scan / radix-rank impls require ``block_dim`` to be a positive multiple
of the subgroup size (enforced by ``impl.static_assert``).  AMDGPU is wave64 in
Quadrants, so the existing ``block_dim=32`` parametrization is degenerate there: it
fails at compile time with ``block.reduce: block_dim must be a positive multiple of
subgroup size``.

Add ``_skip_if_block_dim_lt_subgroup`` and call it from all 12 block-* dim-parametrized
tests.  CUDA / Vulkan / Metal still cover the full ``[32, 128, 256]`` matrix; AMDGPU
exercises ``[128, 256]``.  Also refresh the section-header comment to reflect the wave64
regime.
Picks up '[subgroup] test: lean parameterization for sized reduce / scan tests'.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

…block_dim by arch

Previous parametrization ``block_dim ∈ [32, 128, 256]`` was wave32-shaped: on wave64
``block_dim=32`` was unsupported and got skipped, and the single-subgroup short-circuit
path (BLOCK_SUBGROUPS == 1) was never exercised on AMDGPU because no parameter value
landed at ``block_dim == 64``.

Switch the parameter axis to ``sg_per_block ∈ [1, 4, 8]`` and derive ``block_dim``
inside each test body from a host-side ``_arch_subgroup_size()``.  Each arch now
covers its own canonical regimes:

  wave32 (CUDA / Vulkan-NVIDIA / Metal):  block_dim ∈ [32, 128, 256]
  wave64 (AMDGPU):                        block_dim ∈ [64, 256, 512]

Both arches now exercise the single-subgroup short-circuit + multi-subgroup paths.
The ``_skip_if_block_dim_lt_subgroup`` helper is no longer needed (no degenerate
parameter values reach the kernel) and is removed.
Three assertion-message wrappings updated by black 25.1.0 (the version pinned in
.pre-commit-config.yaml).  Pure formatting; no semantic changes.
…ve64

block.md still said radix_rank requires subgroup_size == 32 and that AMDGPU
(wave64) is unsupported; that's been fixed.  Update the constraints bullet to
reflect the wave32 / wave64 dispatch and refresh the shared-memory footprint
breakdown to call out the wave-size-specific match-mask region (4 KiB i32 on
wave32, 8 KiB i64 on wave64).
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant